library(readr)
getwd()
[1] "C:/Users/fmojt/Documents/RProjects/R_practicals/practicum_1.2"
fifa_df = read_csv("players_22.csv")
Rows: 19239 Columns: 110── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (48): player_url, short_name, long_name, player_positions, club_name, league_name, club_position, club_loaned_from, nationality...
dbl (60): sofifa_id, overall, potential, value_eur, wage_eur, age, height_cm, weight_kg, club_team_id, league_level, club_jersey_nu...
date (2): dob, club_joined
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(fifa_df)
Here we reduce the dataset by filtering the columns we are going to need in this practicum.
library(dplyr)
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
cols <- c("short_name", "league_name", "preferred_foot", "player_positions", "overall", "value_eur", "wage_eur", "dob", "potential")
fifa_df_reduced <- fifa_df %>%
select(all_of(cols))
fifa_df_reduced
First step is to quickly examine the distribution of left-footed players in all leagues.
library(dplyr)
# fifa_df %>%
# filter(preferred_foot == "Left") %>%
# group_by(league_name) %>%
# summarise(left_footed_pl_count = n()) %>%
# print()
# faster alternative
fifa_df_reduced %>%
filter(preferred_foot == 'Left') %>%
count(league_name, name = 'left_footed_pl_count')
Based on the visualization below, we can see that average proportion of left-footed players across all leagues makes up approximately 25% of whole.
library(ggplot2)
fifa_df_reduced %>%
count(league_name, preferred_foot) %>%
ggplot(aes(x = league_name, y = n, fill = preferred_foot)) +
geom_bar(stat = "identity", position = "fill") + # "fill" makes it a proportion (stacked)
labs(title = "Proportion of Left-Footed Players in Each League",
x = "League",
y = "Proportion") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
We utilize ggplot2 library to visualize proportions of left and right-footed players for each player_category.
First, custom groupings of player positions need to be created. We are going to bucket the positions into offensive, defensive and midfielder bins. Final step of this process involves adding a new column storing these categories for each player.
# Define offensive and defensive positions
offensive_positions <- c("LF", "CF", "RF", "LW", "RW", "LS", "ST", "RS")
defensive_positions <- c("LWB", "RWB", "LCB", "CB", "RCB", "LB", "RB")
midfielder_positions <- c("LM", "LAM", "CM", "RM", "RAM", "CAM", "CDM", "DM", "LDM", "RDM")
# Categorize player positions
fifa_df_reduced <- fifa_df_reduced %>%
mutate(position_category = case_when(
player_positions %in% offensive_positions ~ "Offensive",
player_positions %in% defensive_positions ~ "Defensive",
player_positions %in% midfielder_positions ~ "Midfielder",
TRUE ~ "Other"
))
fifa_df
Second step involves grouping the buckets and calculating the count for each.
# Summarize data to get the count of players by position and preferred foot
grouping_counts <- fifa_df_left_right %>%
group_by(position_category, preferred_foot) %>%
summarise(Count = n()) %>%
ungroup()
`summarise()` has grouped output by 'position_category'. You can override using the `.groups` argument.
grouping_counts
NA
By following the principles of functional programming, we are going to define a function which will visualize a distribution plot for any column we may need.
library(tidyverse)
plot_metric <- function(df, target, count, fill, title, subtitle) {
p <- df %>%
ggplot(aes(x = reorder(.data[[target]], .data[[count]]), y = .data[[count]], fill = .data[[fill]])) +
geom_col() +
geom_text(aes(label = .data[[count]]),
colour = "white", position = position_stack(vjust = 0.5)) +
scale_fill_manual(values=c('#3153a2', 'lightgrey')) +
coord_flip() +
labs(title = title,
subtitle = subtitle,
fill = fill) +
theme(legend.position = "bottom",
axis.title.y = element_blank(),
axis.title.x = element_blank(),
# panel.border = element_blank(),
# axis.line = element_line(colour = "grey"),
# axis.ticks = element_line(color = "grey"),
# axis.text.y = element_text(size = 11),
# plot.subtitle = element_text(size = 9)) +
) +
guides(fill = guide_legend(title = fill))
return(p)
}
plot_metric(position_summary, "position_category", "Count",
"preferred_foot",
"Comparison of Player Positions by Preferred Foot",
"Counts of Player Positions of left- and right-footed players")
NA
NA
In this section we are about to analyze whether left-footed players have generally higher Market Value and Vague than their right-footed counterparts. We are going to achieve this by visualizing the mean of both attributes for left- and right-footed player.
First step involves adding new columns containing the average of Wage and Market Value. In order to achieve that, we need to group the dataset by the preferred_foot and calculating the mean of each option.
library(dplyr)
# Summarize data: average value and wage for each foot preference
modified_df <- fifa_df_reduced %>%
group_by(preferred_foot) %>%
summarise(AVG_VALUE_EUR = mean(value_eur, na.rm = TRUE),
AVG_WAGE_EUR = mean(wage_eur, na.rm = TRUE)) %>%
ungroup()
modified_df
library(tidyverse)
# Plot for AVG_VALUE_EUR
plot_metric(modified_df, "preferred_foot", "AVG_VALUE_EUR", "preferred_foot",
"Comparison of Market Value by Preferred Foot",
"Average market value (EUR) of left- and right-footed players")
# Plot for AVG_VALUE_EUR
plot_metric(modified_df, "preferred_foot", "AVG_WAGE_EUR", "preferred_foot",
"Comparison of Market Value by Preferred Foot",
"Average market value (EUR) of left- and right-footed players")
NA
NA
We also compare whether young left-footed players have greater average potential than the right-footed.
Only players born in 2004 are considered young. Therefore we reduce the dataset as follows.
library(dplyr)
# Summarize average potential for left-footed players
left_footed_avg <- fifa_df_reduced %>%
filter(preferred_foot == 'Left' & format(dob, "%Y") == "2004") %>%
summarise(players = 'Left-footed', avg = mean(potential, na.rm = TRUE))
# Summarize average potential for right-footed players
right_footed_avg <- fifa_df_reduced %>%
filter(preferred_foot == 'Right' & format(dob, "%Y") == "2004") %>%
summarise(players = 'Right-footed', avg = mean(potential, na.rm = TRUE))
# Combine both summaries
avg_potential_df <- bind_rows(left_footed_avg, right_footed_avg)
avg_potential_df
# select(short_name, dob, potential, everything())
Finally, we visualize the means in a simple bar plot below.
library(ggplot2)
# Create a bar plot comparing the average potential for left and right-footed players
ggplot(avg_potential_df, aes(x = players, y = avg, fill = players)) +
geom_bar(stat = "identity") +
labs(title = "Comparison of Average Potential of Left- and Right-Footed Players",
x = "Footedness",
y = "Average Potential") +
scale_fill_manual(values = c('#3153a2', "lightgrey")) +
theme_minimal() +
theme(legend.position = "none")
We hypothesized that left-footed players generally play in more offensive positions that the right-footed players. As of the observation we performed in this document we can reject this hypothesis since the greatest proportion of left-footed players are playing in defensive positions.
We also hypothesized that left-footed players have greater market value and wage on average than the right-footed players. Based on the analysis, we can declare this claim as correct but it remains questionable to what extent.
Finally we set another hypothesis which says that young left-footed players have greater potential. After reviewing the average we can proclaim this final hypothesis as true if the observed difference is significant.
First hypothesis can be tested via Chi-square test for independence. The rest of the hypotheses may be tested by a simple independent Two-sample T-test (Student’s T-test) since we are comparing the means of independent groups.
library(dplyr)
fifa_df_clean <- fifa_df_reduced %>%
filter(!is.na(preferred_foot) & !is.na(position_category))
chisq.test(table(fifa_df_clean$preferred_foot, fifa_df_clean$position_category))
Pearson's Chi-squared test
data: table(fifa_df_clean$preferred_foot, fifa_df_clean$position_category)
X-squared = 196.21, df = 3, p-value < 2.2e-16
t.test(value_eur ~ preferred_foot, data = fifa_df_clean)
Welch Two Sample t-test
data: value_eur by preferred_foot
t = 2.7154, df = 7347.1, p-value = 0.006636
alternative hypothesis: true difference in means between group Left and group Right is not equal to 0
95 percent confidence interval:
99667.84 617166.95
sample estimates:
mean in group Left mean in group Right
3123795 2765378
t.test(wage_eur ~ preferred_foot, data = fifa_df_clean)
Welch Two Sample t-test
data: wage_eur by preferred_foot
t = 2.2921, df = 7476.7, p-value = 0.02193
alternative hypothesis: true difference in means between group Left and group Right is not equal to 0
95 percent confidence interval:
110.7626 1419.4680
sample estimates:
mean in group Left mean in group Right
9601.461 8836.345
t.test(potential ~ preferred_foot, data = fifa_df_clean)
Welch Two Sample t-test
data: potential by preferred_foot
t = 6.0452, df = 7797, p-value = 1.561e-09
alternative hypothesis: true difference in means between group Left and group Right is not equal to 0
95 percent confidence interval:
0.4148673 0.8130385
sample estimates:
mean in group Left mean in group Right
71.54765 70.93369